1. Kaggle Kernel Analysis: NYC Taxi EDA

Kernel name: “NYC Taxi EDA - Update: The fast & the curious”
Link: https://www.kaggle.com/headsortails/nyc-taxi-eda-update-the-fast-the-curious

1.1 Introduction

In this project we are going to explore the kernel above showing why the author choose the functions used, comparing to others alternatives and we are going to emphasize the interesting points of the kernel.

1.2 Load data

Original code

library("data.table")
library("tibble")

train <- as.tibble(fread('train.csv'))

Comparison with others alternatives

In R, read.csv is part of the regular functions and is used for load data.frame from a csv file. But when we’re dealing with a huge data.frame this function can take a long time to run.
So in this part the author used a function called fread that performs much faster than read.csv (check the time of each function using profvis!!).
After that other function should be compared: load. This function is used to load variables that have been stored in a .RData file and runs very fast comparing with read.csv and fread.
When is a good ideia to use load? When it’s possible to use a background process to update the data.frame and save it in .RData file.
Let’s take a look at the three possibilities:

library("profvis")
library("data.table")
library("tibble")
library("readr")

profvis({
  # fread
  train <- fread("train.csv")
  # read.csv
  train_readcsv <- read.csv("train.csv")
  # read_csv -> from "readr" package
  train_read_csv <- read_csv("train.csv")
  # fread + as.tibble
  train_tibble <- as.tibble(fread("train.csv"))
  # loading RData
  save(train_readcsv, file = "train_data.RData")
  rm(train_readcsv)
  load(file = "train_data.RData")
})

Tibbles vs data frames

All the information bellow was “greped” from https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html
Tibbles
“Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).”

Major points:

  • It never changes an input’s type (i.e., no more stringsAsFactors = FALSE!);
  • It never adjusts the names of variables (i.e names with spaces will keep the whitespace. Data.frame replaces whitespace for ‘.’);
  • When you print a tibble, it only shows the first ten rows and all the columns that fit on one screen;
  • Tibbles are quite strict about subsetting. [] always returns another tibble. Contrast this with a data frame: sometimes [] returns a data frame and sometimes it just returns a vector.

1.3 File structure and content

A brief overview of our data can summaries the descriptive statistics values of the dataset and detect abnormal items or outliers.

For the summaries

summary(train)
##       id              vendor_id     pickup_datetime    dropoff_datetime  
##  Length:1458644     Min.   :1.000   Length:1458644     Length:1458644    
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :2.000   Mode  :character   Mode  :character  
##                     Mean   :1.535                                        
##                     3rd Qu.:2.000                                        
##                     Max.   :2.000                                        
##  passenger_count pickup_longitude  pickup_latitude dropoff_longitude
##  Min.   :0.000   Min.   :-121.93   Min.   :34.36   Min.   :-121.93  
##  1st Qu.:1.000   1st Qu.: -73.99   1st Qu.:40.74   1st Qu.: -73.99  
##  Median :1.000   Median : -73.98   Median :40.75   Median : -73.98  
##  Mean   :1.665   Mean   : -73.97   Mean   :40.75   Mean   : -73.97  
##  3rd Qu.:2.000   3rd Qu.: -73.97   3rd Qu.:40.77   3rd Qu.: -73.96  
##  Max.   :9.000   Max.   : -61.34   Max.   :51.88   Max.   : -61.34  
##  dropoff_latitude store_and_fwd_flag trip_duration    
##  Min.   :32.18    Length:1458644     Min.   :      1  
##  1st Qu.:40.74    Class :character   1st Qu.:    397  
##  Median :40.75    Mode  :character   Median :    662  
##  Mean   :40.75                       Mean   :    959  
##  3rd Qu.:40.77                       3rd Qu.:   1075  
##  Max.   :43.92                       Max.   :3526282

Data overview

library("dplyr")
glimpse(train)
## Observations: 1,458,644
## Variables: 11
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id          <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime    <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime   <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count    <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...

Comparison with others alternatives

Another popular way to make a data overview is using str. It is very similar to glimpse but str shows less data.

str(train)
## Classes 'data.table' and 'data.frame':   1458644 obs. of  11 variables:
##  $ id                : chr  "id2875421" "id2377394" "id3858529" "id3504673" ...
##  $ vendor_id         : int  2 1 2 2 2 2 1 2 1 2 ...
##  $ pickup_datetime   : chr  "2016-03-14 17:24:55" "2016-06-12 00:43:35" "2016-01-19 11:35:24" "2016-04-06 19:32:31" ...
##  $ dropoff_datetime  : chr  "2016-03-14 17:32:30" "2016-06-12 00:54:38" "2016-01-19 12:10:48" "2016-04-06 19:39:40" ...
##  $ passenger_count   : int  1 1 1 1 1 6 4 1 1 1 ...
##  $ pickup_longitude  : num  -74 -74 -74 -74 -74 ...
##  $ pickup_latitude   : num  40.8 40.7 40.8 40.7 40.8 ...
##  $ dropoff_longitude : num  -74 -74 -74 -74 -74 ...
##  $ dropoff_latitude  : num  40.8 40.7 40.7 40.7 40.8 ...
##  $ store_and_fwd_flag: chr  "N" "N" "N" "N" ...
##  $ trip_duration     : int  455 663 2124 429 435 443 341 1551 255 1225 ...
##  - attr(*, ".internal.selfref")=<externalptr>

(A)uthor and (O)ur comments

  • vendor_id only takes the values 1 or 2, presumably to differentiate two taxi companies (A)
    We can easily check this doing: (O)
levels(as.factor(train$vendor_id))
## [1] "1" "2"
  • pickup_datetime and (in the training set) dropoff_datetime are combinations of date and time that we will have to re-format into a more useful shape (A)
  • passenger_count takes a median of 1 and a maximum of 9 in both data sets (A)
  • The pickup/dropoff_longitute/latitute describes the geographical coordinates where the meter was activate/deactivated (A)
  • store_and_fwd_flag is a flag that indicates whether the trip data was sent immediately to the vendor (“N”) or held in the memory of the taxi because there was no connection to the server (“Y”). Maybe there could be a correlation with certain geographical areas with bad reception? (A)
  • trip_duration: our target feature in the training data is measured in seconds.